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Abstract 

Prior design is one of the most important problems in both statistics and machine 
learning. The cross validation (CV) and the widely applicable information criterion 
(WAIC) are predictive measures of the Bayesian estimation, however, it has been 
difficult to apply them to find the optimal prior because their mathematical properties 
in prior evaluation have been unknown and the region of the hyperparameters is too 
wide to be examined. In this paper, we derive a new formula by which the theoretical 
relation among CV, WAIC, and the generalization loss is clarified and the optimal 
hyperparameter can be directly found. 

By the formula, three facts are clarified about predictive prior design. Firstly, CV 
and WAIC have the same second order asymptotic expansion, hence they are asymp¬ 
totically equivalent to each other as the optimizer of the hyperparameter. Secondly, 
the hyperparameter which minimizes CV or WAIC makes the average generalization 
loss to be minimized asymptotically but does not the random generalization loss. And 
lastly, by using the mathematical relation between priors, the variances of the opti¬ 
mized hyperparameters by CV and WAIC are made smaller with small computational 
costs. Also we show that the optimized hyperparameter by DIC or the marginal 
likelihood does not minimize the average or random generalization loss in general. 

Keywords. Hyperparameter, Cross validation, WAIC, DIC, marginal likelihood 


1 Introduction 


In statistics and machine learning, a method how to design a prior distribution is one of the 
most important problems. It is well known that Bayesian estimation or some regularization 
techniques are useful in practical statistical problems, however, its performance strongly 
depends on the prior, hence we need the theoretical foundation which enables us to evaluate 
the chosen prior. 

Sometimes the parameter in the prior is called a hyperparameter, and the prior design 
problem results in the method how to choose the optimal hyperparameter. Historically, it 
was p roposed that a hyperparam eter is optimized by maximization of the marginal likeli¬ 


hood Goodl . I1952I . lAkaikd . Il 98 [ll | . This method is one of the rational procedures because 


it can be understood as the maximum likelihood method for the marginal distribution, 


1 









however, the optimal prior for this criterion does not minimize the average generalization 
loss in general. In this paper, we study predictive prior design, in other words, a method 
how to choose a prior so as to minimize the average generalization loss. 

The generalization loss can be estimated by the cross validation (CV) and information 
criteria. In Bayesian estimation, the leave-one -out cross validation can be approximated by 


using the important sampling cross validation [Gelfand et al.l . ll992l.lVehtari and Lamoinen 


2002 1 , whose statistical property was studied in Bayesian statistics Peruggial.ll997l . lEpifani et al 


2008 . IVehtari and Oianen . 


2012l |. The deviance information criterion (DIG) was proposed 


for the case when the tru e distribution is realizable by a s tatistical model and the posterior 
is a normal distribution Spiegelhalter et al. , 20021 . 2014 ]. For general cases when the true 
distribution may be unrealizable or the posterior may not be a normal distribution, the 
widely applicable inforrnation criterion (WAIG) was proposed based on singular learning 
theory Watanabe , 2001 , 20091 . 2010bl | and it was proved that W AIG is asymptotically 
equivalent to the leave-one-out cross validation [Watanal^ . 2010a |. Both GV and WAIG 
are studied by using the H amiltonian Monte Garlo m e thods and its improved a^ orithm 
using No-U-Turn dynamics Gelman et al. . 201,1 . 2014 . Vehtari and Oianen . l2012l |. 

From the predictive point of view, both GV and information criteria have three prob¬ 
lems. The first is a theoretical problem about consistency. Both GV and information 
criteria are asymptotically unbiased estimators of the average generalization loss, in other 
words, their expectation values are asymptotically equal to that of the generalization loss. 
However, in general, the minimum point of a random function is not equal to that of 
the average function. Therefore, it has been unknown whether minimization of GV or 
information criteria is asymptotically equal to minimization of the average generalization 
loss or not. In this paper, we prove that, if a statistical model is regular, minimization 
of GV or WAIG makes the average generalization loss to be minimized asymptotically, 
whereas minimization of DIG or maximization of the marginal likelihood does not even 
asymptotically. 

The second is a problem about the difference between the random and average general¬ 
ization losses. The former is the random variable which depends on a given set of training 
samples, whereas the latter is the expectation value over all training sets taken from the 
true distribution. We show in this paper that the hyperparameter that minimizes the 
random generalization loss does not converge to that of the average generalization loss. It 
follows that, although the optimal hyperparameter for the minimum GV or WAIG asymp¬ 
totically minimizes the average generalization loss, it does not the random generalization 
loss. 

The last is a practical problem. When GV or information criteria is employed, it is 
not easy to determine the region of candidate hyperparameters, because we do not know 
whether the optimal hyperparameter exists, or, if it does, where it is. Moreover, after 
the region is determined, in order to compare GV or information criteria the posterior 
distributions are required for all candidate hyperparameters, resulting in heavy computa¬ 
tional costs. The new formula obtained in this paper enables us to directly estimate GV 
and WAIG as a function of a hyperparameter, hence the optimal hyperparameter can be 
found without comparing candidate hyperparameters. Also we show a method by which 
the variances of the chosen hyperprameters by GV and WAIG are made smaller using the 
new formula. 

This paper consists of eight sections. In the second and third sections, we define the 
basic definitions in Bayesian statistical learning and introduce the main results of this 
paper. In the fourth section, we study two examples. The fifth and sixth chapters are 
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devoted to the proofs of the basic lemmas and the main theorems respectively. In the 
seventh section, several points of the main results are discussed, and in the last section, 
we conclude the paper with the problem for the future study. 

2 Basic Definitions in Bayesian Statistical Learning 

In this section, we introduce the basic definitions in Bayesian statistical learning. Let 
q{x) be a probability density function on the N dimensional real Euclidean space and 
Xi,X 2 , ■■■,Xn be random variables on which are independently subject to q{x). The 
probability density function q{x) is sometimes referred to as a true distribution. A training 
set is denoted by X"’ = (Xi,X 2 , ...,X„), where n is the number of training samples. The 
average E[ ] shows the expectation value overall training sets X”’. A statistical model or 
a learning machine is defined by p{x\w) which is a probability density function of x G 
for a given parameter re G IT C A nonnegative function ip{w) on the parameter set 
IT is called a prior distribution. If it satisfies 

J (p{w)dw = 1, (I) 

then ip{w) is called to be proper. In this paper, we study both proper and improper priors, 
hence eq. © does not hold in general. The posterior distribution is defined by 

piwlX"-) = f\piXi\w), (2) 


where Z{ip) is a normalizing constant. 

/ n 

p{w) Y\p{Xi\w)dw. 

i=l 

We assume that Z{ip) is hnite with probability one. If p{w) is proper, then Z{(p) is equal to 
the marginal likelihood. The expectation value of a given function f{w) over the posterior 
distribution is denoted by 

= j fiw)piw\X'^)dw. (3) 

The predictive distribution is the average of a statistical model over the posterior distri¬ 
bution, 

p(x|X"') = E^\p{x\w)]. (4) 

The random generalization loss is defined by 

G{(p) = — J q{x)logp{x\X"')dx. (5) 

Note that the random variable G{(p) depends on the training set X*^. The average gen¬ 
eralization loss is defined by E[G(<^)]. In this paper, we show that the random variable 


3 



G{^p) has a different behavior from its expectation value E[G((/9)] as a functional of 
even asymptotically. The Bayesian leave-one-out cross validation (CV) is defined by 


CV(¥.) 


1 ” 

--y^\ogp{Xi\x^\Xi 

n 


i=l 


- VlogE, 


2=1 


p{Xi\w)y 


( 6 ) 

(7) 


where X"^ \ Xj is a set of training samples leaving Xi out. A calculation method of CV 
by eq.© using the posterior distribution by the Markov chain Monte Carlo method is 
sometimes called the important sampling cross validation. The training error T{(p) and 
the functional variance V(p) are respectively defined by 


Tip) = --^logp(Xi|X”) =-^logE<^[p(Xi|ui)], 

^ i=l ^ i=l 

n 

2 = 1 

Then the widely applicable information criterion (WAIC) is defined by 

WAICfy?) = Tip) + -Vip). 

n 

For a real number a, the functional cumulant function is defined by 


Tcumioi) — 

n 


Then, as is shown in Watanabel . l2010al |. 


^logE^[p(A»“]. 

2=1 


CXip) = F(-l), 

Tip) = -F(l), 

= nF"(0), 

WAIC((^) = -F(1) + F"(0). 


( 8 ) 

(9) 


( 10 ) 


( 11 ) 


( 12 ) 

(13) 

(14) 

(15) 


In the previous papers, we proved by singular learning theory that, even if a true distri¬ 
bution is unrealizable by a statistical model or even if the posterior distribution is not the 
normal distribution, 

E[CV((/p)] = K[Gip)]+Oi^), 

E[WAIC(v.)] = E[G((^)]+0(4), 

WAlCip) = CXip) + Opi^). 

However, it has been left unknown whether minimization of CV((^) or WAIC((^) with 
respect to a prior p is asymptotically equivalent to minimization of Gip) and E[G((/?)] or 
not. In this paper, we prove in Theorem 1 that, if a statistical model satisfies the several 
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regularity conditions, minimization of CV((/9) or WAIC((/7) is asymptotically equivalent to 
minimization of E[G((/9)] but not to G{(p). 

In the hyperparameter optimization problem, two alternative methods are well known. 
The former is maximization of the marginal likelihood or equivalently minimization of the 
free energy or the minus log marginal likelihood. 


PfreeiSP) 


log / (p{w) Ylp{Xi \w)dw + log / ip{w)dw. 


2 = 1 


(16) 


In order to use this method, the integration of (p{w) should be finite, because, if it is not 
finite, Ff^eeip) can not be defined. The latter is the deviance informaiton criterion (DIG), 

1 

DIC((/9) = - ^{-2E^[logp{Xi\w)] + \ogp{Xi\E^[w])}. (17) 

Ti . 

2=1 

In this method a prior may be improper like as CV and WAIC. In this paper, we show 
that the hyperparameter which minimizes Ffj-eei'-p) 010(^9) does not minimize either 
E[G((/5)] or G{ip) even asymptotically in general. 


3 Main Results 


3.1 Definitions and Conditions 


In this section, we introduce several notations, regularity conditions, and definitions of 
mathematical relations between priors. 

The set of parameters W is assumed to be an open subset of In this paper, (po{w) 
is an arbitrary fixed prior and <^(tc) is a candidate prior which will be optimized. We 
assume that, for an arbitrary w G IT, (po{w) > 0 and p{w) > 0. We do not assume that 
they are proper. The main purpose of this paper is make a new formula which enables us 
to directly estimate CV((/9) — CV((/9o) and WAIC((^) — WAIC((/9o). 

The prior ratio function (j){w) is denoted by 


(j){w) 


P>{w) 

Po{w)' 


If (po{w) = 1, then (p{w) = p{w). The empirical log loss function and the maximum a 
posteriori (MAP) estimator w are respectively defined by 


1 X ^ 1 

L{w) = -y^logp(Ai|u;)-logv9o(^c), 

n n 

2 = 1 

(18) 

w = arg min L{w), 
w£W 

(19) 


where either L{w) or w does not depend on (p{w). If paiw) = 1; then w is equal to the 
maximum likelihood estimator (MLE). The average log loss function and the parameter 
that minimizes it are respectively defined by 


C{w) 

Wo 


-1 

arg min C(w). 
w&W 
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( 20 ) 

( 21 ) 



In this paper we use the following notations for simple description. 

Notations. 

(1) A parameter is denoted hy w = {w^ ,w'^, E Remark that means 

the A:th element of w, which does not mean w to the power of k. 

(2) For a given real function f{w) and nonnegative integers ki,k 2 , we dehne 

dmf 

/fclfca-fem = fklk 2 -k^{w) = Q^klQyjk2 . . . Qyjkm. 

(3) We adopt Einstein’s summation convention and ki, k 2 , k^,... are used for such suffices. 
For example, 

d 

k2=l 

In other words, if a suffix ki appears both upper and lower, it means automatic summation 
over ki = 1, 2,..., d. In this paper, for each ki, k 2 , 

In order to prove the main theorem, we need the regularity conditions. In this paper, 
we do not study singular learning machines. 


Regularity Conditions. 

(1) (Parameter Set) The parameter set W is an open set in R'^. 

(2) (Smoothness of Models) The functions log(/j(t(;), logipQ{w), and logp(x|r(;) are (7°°- 
class functions of re E W, in other words, they are inhnitely many times differentiable. 

(3) (Identifiability of Parameter) There exists a unique wq G W which minimizes the 
average log loss function C{w). There exists a unique u) E IF which minimizes L{w) with 
probability one. It is assumed that the convergence in probability w ^ wq {n ^ oo) holds. 

(4) (Regularity Condition) The matrix Ckik 2 (^o) is invertible. Also the matrix Lk^k 2 (^) 
is invertible for almost all rc in a neighborhood of wq with probability one. Let 

be the inverse matrix of 

(5) (Well-Definedness and Concentration of Posterior) We assume that, for an 
arbitrary |a| < 1 and j = 1, 2,..., n + 1, 


IEx„+ilE 


logE^\p{Xj\w)°‘] 


< oo. 


(23) 


The same inequality as eq. (l23]l holds for ipo{w) instead of ^{w). Let Q{X^,w) be an 
arbitrary finite times product of 


{logip{w))kik2-kp, 

(log ipo{w))kik 2 -kg, 

1 " 

i=l 

iJ^^^Hw))k,k2-k., 


where |a| < 1, p,q,r, s > 0 and ]()[ shows a finite product of a combination {ki,k 2 , ..,kr). 
Let 


IF(e) = {tc E IF; [re — 'u)| < 


(24) 
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It is assumed that there exists e > 0, for an arbitrary such product Q{X'^,w) 


(25) 

(26) 


E[sup \Q{X'^,w)\] < oo, 
W{e) 

E[Q(X”,zh)] ^E[Q(X^u;o)], 


and that, for arbitrary |a| < 1 and /3 > 0, 


¥.^[Q{X^,w)p{Xj\wY] 

K^\p{Xj\wY\ 



lw{e) 


Q{X'^,w)p{Xj\w)‘^ Y\_p{Xi\w)ip{w)dw 


i=l 



n ’ 

p{Xj\w) \w)(p{w)dw 

i=l 


(27) 


where Op(l/n^) satisfies n^E[|op(l/n^)|] —)• 0. Also we assume that the same equation as 
eq. m holds for po{w) instead of p{w). 


Explanation of Regularity Condition. (1) In this paper, we assume that p{x\w) is 
regular at rco, that is to say, the second order matix £^^^ 2 (^ 0 ) is positive definite. If this 
condition is not satisfied, then such {q{x),p{x\w)) is called singular. The results of this 
paper do not hold for singular learning machines. 

(2) Conditions eq. (f25]l and eq. (|M]) ensure the finiteness of the expectation values and 
concentration of the posterior distribution. The condition of the concentration, ea. (l27p . 
is set by the following mathematical reason. Let S{w) be a function which takes the 
minimum value S{w) = 0 at rc = u). If Skik 2 {w) is positive definite, then by using the 
saddle point approximation in the neighborhood of tc, 

eyip{-nS{w)) k, eyip(^-^Sk:^k 2 {'w){w - w)^^{w - 


hence the orders of integrations inside and outside of W (e) are respectively given by 

f exp{—nS{w))dw = 

Jw(e) 


I 

Jv\ 


W\W{e) 


exp{—nS{w))dw = 0(exp(—n^)). 


Therefore the integration over W \ W{e) conveges to zero faster than that over W (e) as 
n —>■ 00. 


Definition. (Empirical Mathematical Relations between Priors) The empirical mathe¬ 
matical relation between two priors p{w) and po{w) at a parameter w is defined by 

M(</., w) = (log (log <())fc 2 + (log cj))k,k 2 + (log ^)k ,, (28) 
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where 4>{w) = ip{w)/ipo{w) and 


= Inverse matrix of 

(ri;) (w;) (r«)Lfc3fc3fc, (u;)Ffc,,fc,(7n), 


where Lk^k 2 {w) and Lk^k 2 k 3 {w) are the second and third derivatives of L{w) respectively 
as defined by ea. (l2^ and 




Bkik2,k3{w) 


1 ^ \ 

- '^(}ogp{Xi\w))ki{logp{Xi\w))k2, 

^ i=i 
1 

- '^{^ogp{Xi\w))kik2{'^ogp{Xi\w))ki. 
^ i=i 


Remark. Note that neither A^^^‘^{w), nor C^^{w) depends on a candidate prior 

(p{w). 


Definition. (Average Mathematical Relations between Priors) The average mathematical 
relation Ai{(j),w) is defined by the same manner as ea. (l28p by using 



= Inverse matrix of Tfc^A: 2 (rf), 

(29) 

Bkik2 (rr’) 

= j {-'^ogp{x\w))kik2Qix)dx, 

(30) 

Bkik2k3 {w) 

= j {-'^ogp{x\w))kik2k3Qix)dx, 

(31) 


= j {logp{x\w))kiilogp{x\w))k2qix)dx, 

(32) 

d~kik2^k3 (rr^) 

= j i^ogp{x\w))kik2(^ogp{x\w))k3qix)dx, 

(33) 


instead of Lk^k 2 {w), Lk^k 2 ks.{w), Fk^^ksiw), and Fk^k 2 ,k 3 i'^) respectively. 

The self-average mathematical relation (M) (0, w) is defined by the same manner as 
FI{4>, w) by using 


{J^^^^){w) = Inverse matrix of {Lk^k 2 ){w), 

{Lk^k2){w) = y(-logp(x|r(;))fcifc2p(x|u;)dx, 

{Lkik2k3)iw) = j i-'^ogp{x\w))kik2k3Pix\w)dx, 

{Fk^,k2)iw) = j (logp(x|tc))fci(logp(x|t(;))A:2P(x|u;)dx, 

{Bkik2,k3)i'^) = / (^ogp{x\w))kik2i^ogp{x\w))k3p{x\w)dx, 


(34) 

(35) 

(36) 

(37) 


(38) 


instead of Lk^k 2 {w), Lk^k 2 k 3 {w), Fk^^k^iw), and Fk^k 2 M{'^) respectively. 

Remark. In the self-average case, it holds that {Lk^k 2 ){w) = {Fk^^k 2 ){w)-, hence {M){(j), w) 
can be calculated by the same manner as ea. (l28]) by using 

{C’^^)iw) = {j’^^’^^){w){j’^^^^){w){Fk,k,M)H - {j’^^’^^){w){j’^^’^^)iw){Lk,k^k,)iw). 
instead of and C^^{w). 

3.2 Main Theorem 

The following is the main result of this paper. 

Theorem 1. Assume the regularity conditions (1), (2), and (5). Let M{(f>,w) and 
Ai{4>,w) be the empirical and average mathematical relations between ip{w) and ipo{w). 
Then 


CV(^) 

= 0V(^„) + + o,(T). 

(39) 

E[CV(¥.)] 

= E|CV(v„)| + 411^ + 0(T), 

(40) 

WAIC((^) 

= wAic(^„) + E^ + Op(T). 

(41) 

E[WAIC(^)] 

= E[WAIC(ypo)] + + 0(4), 

(42) 

CV(<^) 

= WAlC{ip) + Op{^), 

(43) 


and 

M{(j),w) = M{(f),wo) + Op{-^), (44) 

M{4>,Eu,[w]) = M{(j),w) +Op{-), (45) 

n 

E[M{4>,w)] = M{4',wo) + 0{—). (46) 

n 

On the other hand, 

G{(p) = G{g:)o) + -{w''^-{wo)'^^){log^)ki{w) + Op{^) 
n 

= G{ipo) + Op{—^), (47) 

IE[G(^)] = E[G(^o)] + ^^^^%^ + 0(4)- (48) 

n'=' 

From Theorem 1, the five mathematical facts are derived. 

(1) Assume that a prior ip{w) has a hyperparameter. Let h{f) be the hyperparameter that 
minimizes /(</?). By ea. (l3^ and eg. lHTT) . /i(CV) and /i(WAIC) can be directly found by 
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minimizing the empirical methematical relation M((j),w) asymptotically. By eq. (j44p and 
ea. (|48|) . /i(CV) and /i(WAIC) is asymptotically equal to /i(IE[G]). Remark that CV((/ 3 ), 
WAIC((^), and G{ip) may be unbounded or may not have a minimum value as a function 
of a hyperparameter if the set of all hyperparameters is not compact. By using M((/>, u)), 
we can examine whether they have a minimum value or not. The divergence phenomenon 
of CV(y 5 ) and WAIC((/?) as functions on a noncompact set of hyperparameters is discussed 
in Section ESI 

(2) The variance of /i(CV) is asymptotically equal to that of /i(WAIC), however, they may 
be different when the number of training samples are finite. 

(3) In calculation of the mathematical relation M((p,w), the MAP estimator w can be 
replaced by the posterior average parameter E^„[r(;] asymptotically. 

(4) By eq. (l47p . the variance of the random generalization loss G{ip) — G{(po) is larger 
than those of CV(v 9 ) — CV(<y 9 o) and WAIC(<y 9 ) — WAIC((/?o)- Neither h{CY) nor /i(WAIC) 
minimizes the random generalization loss G{ip) in general. 

(5) It was proved in (Watanabe, 2010) that E[G((/Jo)] = d/{2n) + o(l/n), where d is the 
dimension of the parameter set. Assume that there exist finite sets of real values {dk} and 
{ 7 fc}, where 7 *, > 1 , such that 



Since E[CV( 7 Jo)] of A" is equal to E[G(<y 9 o)] of X"' ^ and 

_J. _^ ^ M . 

n — 1 n n? ^ n? ' 

it immediately follows from Theorem 1 that 

E[G(v)l = E|G(w)l + ^hl^ + c.(4), (49) 

E[CV{v)] = E|G(,,„)] + ^^4±41 Ai4 + „(^), (50) 

E[WAIC(<^)] = E[G((^o)] + + 

Theorem 2. Assume the regularity conditions (1), (2), and (5). If there exists a 
parameter tcq such that q{x) = p{x\wo), then 

{M){(j),w) = M{(j),w) + Op{-^), (52) 

{M){cj),w) = M{cl),wo) + Op{^). (53) 

V n 


By Theorem 2, if the true distribution is realizable by a statistical model or a learning 
machine, then the empirical mathematical relation can be replaced by its self-average. The 
variance of the self-average mathematical relation is often smaller the original one, hence 
the variance of the estimated hyperparameter by using the self-average is made smaller. 

Based on Theorem 1 and 2, we define new information criteria for hyperparameter 
optimization, the widely applicable information criterion for a regular case and a regular 
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case using self-average, 


WAICR 

WAICRS 


w) 

v? ’ 
{M){(p,w) 


(54) 

(55) 


where w can be replaced by The optimal hyperparameter for predictive prior 

design can be directly found by minimization of these criteria if they have the minimum 
points. 


4 Examples 

4.1 Normal Distribution 

A simple but nontrivial example is a normal distribution whose mean and standard devi¬ 
ation are (m, 1/s), 


p{x\m,s) = 

For a prior distribution, we study 

ip{m,s\X,p,e) 

where (A, /x, e) is a set of hyperparameters. Note that the prior is improper in general. If 
X > 0, fj, > —1/2 and e > 0, the prior can be made proper by 

4>(m,s|A,/x,e) = ^(p{m,s\X, p,e), 

where 

C = ^{e/2)-^^-^/^r{p + l/2). 

We use a fixed prior as poi'iTT', s) = 1, then the empirical log loss function is given by 



L(m,s) = (^8) 

2=1 

Let Mj = (1/n) ~ U — 2,3,4). The MAP estimator is equal to the MLE 

w = where = I/M 2 , resulting that 








( 1/(25) 0 A 
Vo s2 j ’ 

( 1/s -fM^/2 \ 

\-s^M^/2 {s^+ s^Mi)/2 ) ^ 

(0, s -|- skills). 


(59) 

(60) 
(61) 
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Also the self-average mathematical relation is given by 



( 1/(25) 0 A 

Vo f )' 

(62) 


= 2(A^i^2^(h;), 

(63) 

{C‘‘)(u.) 

= (0,5). 

(64) 


The prior ratio function is (p{w) = (p{w), hence the derivatives of the log prior ratio are 


(log (/))i(w) 

(log (p) 2 {w) 
(log <j))n{w) 
(log(/>)i2(w) 
(log(/>)22(w) 


—Xsrh, 

\m? 

-—+»*/» 

—As, 

—Am, 


£ 

2 ’ 


Therefore, the empirical and self-average mathematical relations are respectively 

/2 + — £s/2)‘^ 

-|-(—Asm^/2 -|- fi/2 — es/2)(l -|- s'^M^) — A -|- Xms'^Ms, 

(M){(f), 771, s) = ^X'^sin^ + {—Xsin^/2 + — £s/2)‘^ 

-|-4(—Asm^/2 -|- fi/2 — es/2) — X. 

When A = e = 0, M{ip,m,s) is minimized at /x = — (1 -|- s^M/^)/A, whereas {M){(f>,rn, s) 
at n = —1. 

In this model, we can derive the exact forms of CV, WAIC, DIG, and the free energy, 
hence we can compare the optimal hyperparameters for these criteria. Let 

/ n 

p{X\w)°‘Wp{Xi \w)(p{w)dw. 
i=i 


Then 

Zn{X,a) = ( 27 r)(n+«-i )/2 - c(a) log d(a)) r(c(a)), 

where r( ) is the gamma function and 


a{a) 

— Ot \ Tl^ 

bi{a) 

71 

= aX + -£xj, 


i=i 

c{a) 

= /X-|-(cr-|-Ti-|-l)/2, 

di{a) 

= (l/2)(aA + f;x| 
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AG 

AIT 

WAICR 

WAICRS 

AH 

AG 

Average 

-1 

-0.00194 

-0.00175 

-0.00147 

-0.00165 

0.00332 

-0.00156 

STD 

-1 

0.00101 

0.00080 

0.00062 

0.00001 

0.00001 

0.01292 

Average 

1 

0.00506 

0.00489 

0.00450 

0.00467 

0.00006 

0.00445 

STD 

1 

0.00095 

0.00076 

0.00059 

0.00004 

0.00002 

0.01250 


Table 1: Averages and Standard Errors of Criteria 


All criteria can be calculated by using Z{X, a) by their definitions, 


CV((^) 

WAIC((^) 

DIC((^) 

F'free (}P) 


1 

-- ^{log Z„(0,0) - log Zn{Xi, -1)}, 

2=1 

1 ^ 

--^{logZ„(Xi,l) -logZ„(0,0) - ^(logZ„(Aj,0))}, 
2=1 

1 n Q 

-- X]{2^(log^n(Xi,0)) - logp(Ai,m,s)}, 

2=1 

-logZ„(0,0) + logZo(0,0), 


where m = 6(0)/a(0) and s = (2/x + n + l)/(X]j — &(0)^/a(0) + e). 

A numerical experiment was conducted. A true distribution q{x) was set as AA(1,1^). 
We study a case n = 25. Ten thousands independent training sets were collected. A 
statistical model and a prior were defined by eq.([56]) and eq.((57| respectively. The fixed 
prior was ipo{w) = 1. We set A = e = 0.01, and studied the optimization problem of the 
hyperparameter /r. Firstly, we compared averages and standard deviations of criteria. In 
TablelU averages and standard deviations of 


AG = 

CV(<^) - CV(<^o), 

AIT = 

WAIC((/?) - WAIC((/?o), 

WAICR = 

M{4>, w)/n^, 

WAICRS = 

{M){(j),w)/n‘^, 

AH = 

DlC{ip) -mC{ipo), 

AG = 

G{ip)-G{ipo), 


are shown for the two cases /r = ±1. In this experiment, averages of AC, AIT, WAICR, 
and WAICRS were almost equal to that of AG, however that of AH was not. The standard 
deviations were 


u(AG) » (t(AC) > a(AIT) > u(WAICR) > u(WAICRS) ^ a{ADIC). 

The standard deviation of AG was largest which is consistent to Theorem 1. Note that 
CV had the larger variance than WAIC. WAICRS gave the most precise result. 

Secondly, we compared the distributions of the chosen hyperparameters by criteria. 
One hundred candidate hyper parameters in the interval (—2.5, 2.5] were compared and 
the optimal hyperparameter for each criterion was chosen by minimization. Remark that 
the interval for the free energy was set as (—0.5, 2.5] because the prior is proper if and 
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h(CV ) 

h(WAIC) 

h (WAICR) 

h(WAICRS) 

h(DIC) 

h(F) 

Average 

-0.9863 

-0.9416 

-0.9329 

-0.9993 

0.4512 

-0.2977 

STD 

0.2297 

0.19231 

0.1885 

0.0059 

0.0077 

0.0106 

A-2STD 

-1.4456 

-1.3262 

-1.3100 

-1.0112 

0.4358 

-0.3188 

A-F2STD 

-0.5269 

-0.5569 

-0.5559 

-0.9874 

0.4667 

-0.2766 


Table 2; Chosen Hyperparameters in Normal distribution 

only if ^ > —0.5. In TablelSl averages (A), standard deviations (STD), and A± 2STD of 
optimal hyperparameters are shown. 

In this case, the optimal hyperparameter for the minum generalization loss is almost 
equal to (—1), whose prior is improper. By CV, WAIC, WAICR, WAICRS, the optimal 
hyperparameter was almost chosen, whereas by DIG or the free energy, it was not. The 
standard deviations of chosen hyperparameters were 

a(/i(CV)) > (t(/i(WAIC)) > (t(/i(WAICR)) 

> a{h{F)) > cj(/i(DIC)) > a(/i(WAICRS)). 

In this experiment, neither the marginal likelihood nor DIG was appropriate for the pre¬ 
dictive prior design. 

4.2 Linear Regression 

Let us study a linear regression problem. Let a statistical model of y G x G 
w G and A G be 

( 65 ) 

ip{w\X) = exp(--^||rr|p), (66) 

where cr > 0 is a constant. The basic prior is set as ^poiw) = 1. Hence the log prior ratio 
function is (p{w) = (^(u)|A) and w is equal to MLE. The function L{w) without a constant 
term is 


L{w) 


2ct2 


- w ■ Xif. 


2=1 


(67) 
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It is immediately derived that 


and 


n n 


Wk 

1 — ^ ^iki Xik2) (^ ^ yiXik\)i 

i=l i=l 

(68) 

Lkik2{w 

1 . ^ ^ 

) ~ 2 ^ ^ Xik-^Xik2, 

a^n 

i=\ 

(69) 

^kik2ks (^ 

) = 0, 

(70) 

FkiMi"^ 

1 - ^ ^ 

2=1 

(71) 

Fkik2,k3i'^ 

1 ^ 

) — (J^fl ^ ^ 5 

(72) 

(log (/))ki{w 

) = -Arhfci, 

(73) 

(log 4>)kik2{w 

) ^^kik2} 

(74) 

iLkik2)~^iw)- 

Hence 


= 


(75) 

B^lfc2(T;) = 


(76) 

C^^{w) = 


(77) 


resulting that 


M{(t),w) = - XtT{B{w)+ c'"^{w)wki), (78) 


A' 


{M){<j),w) = —{J’^^’^^{w)wk^wk 2 )-Xtr{j{w)), 


(79) 


where we used = J^i ^2 (C*)^! (re) = 0. In this model the optimal hyperparam¬ 

eter for WAICRS is directly given by 

tr(J(re)) 


J^^^^{w)Wk^Wk2' 

The exact CV, WAIC, DIG, and the free energy are also calculated. Let 


(80) 


Zn{X,Y,a) = / p{Y\X,w)'^'^p{Yi\Xi,w)(p{w)dw. 


2=1 


Then 


where 


Z.a{X,Y,a) = 


- (aY^+ EU ^fl)) 


(27rcJ^)(”+“)/^ det(A(Q;))^/^ 


A{a) = aXX^ + a^XI + '^XiXl, 

n 

b{a) = aYX + Y^Yi^i 


2=1 


2=1 


15 




h(CV ) 

h(WAIC) 

h(WAICR) 

h(WAICRS) 

h(DIC) 

h(F) 

Average 

5.0064 

5.0017 

4.9320 

5.0253 

5.0248 

1.0000 

STD 

1.9358 

1.9297 

1.8808 

0.2960 

0.2961 

0.0000 


Table 3: Chosen Hyperparameters in linear regression 


All criteria can be calculated by using Z{X^ Y, a) by their definitions, 

1 ” 

CY{ip) = --J2{^ogZn{0,0,0)-logZniXi,Yi,-l)}, 

- -^{logZn{Xi,Yi,0))}, 

1 fi 

DIC((/p) = --^{2—(\ogZ4Xi,0,0))-\ogp{Xi,Yi,m)}, 

1=1 

Ffreei^) = “ log Z„(0, 0, 0) + log Zo(0, 0, 0), 
where m = A(0)“^6(0). 

A numerical experiment was conducted for the case q{x,y) = q{x)p{y\x, wq). Here 
q{x) was the normal distribution AA(ao,/ 5 ), where uq = (1,1,1,1,1), and Is is the d = 5 
dimensional identity matrix, and wq = (1,1,1,1,1). A constant a = 0.1 was set. The 
candidate hyperparameters for A were taken from the interval (0,10). Distributions of 
chosen hyperparameters are shown in TablejSj The average hyperparmeters chosen by 
CV, WAIC, WAICR, WAICRS, and DIG were almost equal to each others. The variances 
by WAICRS and DIC were smaller than other methods. The optimal hyperparameter 
by the marginal likelihood was different from other methods. In this case, the posterior 
distribution is rigorously equal to the normal distribution and the true distribution is 
realizable by a statistical model, hence DIC can be applied, whose value was almost equal 
to WAICRS. Note that, in this model, the true parameter is wq = 0, then the optimal 
hyperparameter A diverges as n —oo. This phenomenon is caused by the fact that wq = 0 
is contained in the divergent parameter, which is discussed in Section 17.21 


1 " 

WAIC{^) = --J2{^ogZn{X„Yi,l)-logZn{0,0,0) 


2 = 1 


5 Basic Lemmas 


The main purpose of this paper is to prove Theorems 1 and 2. In this section we prepare 
several lemmas which are used in the proof of the main theorem. 

For arbitrary function f{w), we dehne the expectation values by 




f{w)ip{w)p{Xj\w) \w)dw 


2 = 1 


« IL 

I ip{w)p{Xj\w)^^Y\pi^i\'^)d' 


w 


i=l 


Then the predictive distribution of x using training samples X'^ leaving Xj out is 
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Thus its log loss for the test sample Xj is 

- \p{Xj\w)]. 

The log loss of the leave-one-out cross validation is then given by 

n 

CX{ip) = --^logE(r^)[p(X,»]. 

^ i=i 

Lemma 1. Let 4>{w) = ^p{w)/ipo{w). The cross validation and the generalization 
satisfy the following equations. 


CV((^) = CV((/?o) -^{logE^^^)[(/>(?u)] - logE<^o[(/>(u;)]}, 


j=i 


G{ip) = G{(po) - Ex„+i logE5,+("+^))[(/)(rc)] - logE^^[(p{w)] 

(Proof of Lemma 1) By the definitions E<^[ ], E^ ], it follows that 

« n 

/ t { w ) n p{Xi\w)dw 


E^-^^\piXj\w)] = 


2=1 


ip{w) n p{Xi\w)dw 


^ n ^ n 

ip{w)Y\p{Xi\w)dw (po{w) \w)dw 

_i=l_ _ _i=l_ 

n ^ n 

To{w)'^p{Xi\w)dw / ipQ{w)'^p{Xi\w)dw 


2=1 




/ n 

(po{w) n p{Xi\w)dw 

i¥=j 


/ n 

(p{w) n p{Xi\w)dw 

= E(,-^')[p(X,»]/E(,7) [(/)]. 

By the definition CV((/9), ea. dSip . the first half of Lemma 1 is obtained. For the 


( 81 ) 

error 

(82) 

(83) 


latter 
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half, 


E^\p{Xn+l\w)] 


/ n 

Lp{w)p{Xn+l\w) n p{Xi\w)dw 
2=1 


/ ^{w) n p{Xi\w)dw 

i=l 

n+1 „ n+1 

(p{w) n p{Xi\w)dw / ipQ{w)Y\_p{^i\w)dw 


2=1 


2 = 1 


n+1 p ^ 

‘Po{w)W_p{Xi\w)dw / n p{Xi\w)dw 

i=i *=i 


/ n 

(po{w) n p{Xi\w)dw 

i=i 

/ n 

ip{w) n p{Xi\w)dw 

i=i 


= E+f+i)[(/)] E^MXn+i\w)] / 

By using the definition of the generalization error, eq.([S]), the latter half of Lemma 1 is 
obtained. (Q.E.D.) 

Definition. The log loss function for X^ \ Xj is defined by 


L{w,-j) 


1 

n 


'^log p{Xi\w) 
i¥=j 


-log (po{w). 
n 


The MAP estimator for X” \ Xj is denoted by 

Wj = arg min L{w,—j). 

Lemma 2. Let f{w) be a function Q{X^,w) which satisfies the regularity conditions (1), 
(2), (5). Then there exist functions Ri{f,w) and R 2 {f,w) which satisfy 


Ev^ol/H] = /+)+ ^ 

= /(+•) + 

where Ri{f,w) is given by 


Ri{f,w) i? 2 (/+) 


+ 


1 






n'^ 


Rlif,Wj) R2{f,Wj) J_ 

n — 1 (n — 1)^ ^ ’ 


+ (/,+ = \fk,k,iw)J^^’‘Hw)-^fkAw)V>^^iw). 


(84) 

(85) 

( 86 ) 


and V'^^{w) = J^^^‘^{w)J^^^'^{w)Lk^k 3 k 4 ,{w). 

We do not need the concrete form of R 2 {f,w) in the proof of the main theorem. 
However, it is given in the proof of Lemma 2, eq. (jin8h . 
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(Proof of Lemma 2) Since L{w) is a constant function of w, by using the regularity 
condition (5), 


^voifiw)] = 

f f{w) exp{—nL{w) + nL{w))dw 


f exp(—nL(w) + nL{w))dw 


= 

/(£■) 1 /(/(^) “ ■(’(^))®^P(“^^(^)+ 

f exp(—nL(w) + nL(w))dw 


= 

/(») + |i(i + o,(T)). 

Zo nv 

(87) 

where 



Zi = 

/ (/(^) “ ex.p{—nL{w) + nL{w))dw, 

(88) 


lw(e) 


Zo = 

/ exp{—nL{w) + nL{w))dw. 

(89) 


Iw(e) 


Note that the definition of W (e) is given in ea.([2^. Let u = y/n{w—w). Then du = 

n'^/'^dw 


and the integrated region is |u| < n^. The Taylor expansions of f{w) and nL{w) among 
w are respectively given by 

f{w)-f{w) = Hi{u), 
n{L{w)-L{w)) = + H 2 {u), 

where Hi{u) and H 2 {u) are functions defined by 

+^fk.k2k3k,u>^^u’^^u^^u^^ + ^-^, (90) 

+ ^20^3/2 Lk3k2k3kik3U^^u''^u^^u’^^u^'’ + (91) 

where 5 'i(u) and g 2 iu) are constant order functions. In these equations, the derivatives of 
/ and L are defined by their values at w = w. We use notations, 

p{u) = exp{-^Lkj^k2U^^u’"^), (92) 
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Remark that jg inverse matrix of Lk^k^, hence 


where 




(94) 

r ^ 

/ tt^^ p(n)(in 

1=1 

= _j_ jklki jk2k3^ 

(95) 

f ^ 

/ JJu^^p(u)fiu 
1=1 

= Co5?/m(J)'=l''2^3fc4fc5fc6^ 

(96) 

f ^ 

1=1 

= Co5?/m(J)^l'=2fc3fc4fc5fc6fc7fc8^ 

(97) 

gyjn^j'jkik2k3k4k5kQ _ jmirn2 jrn3ni4 jrn^me 

mi,...,7716 

J)fclfc2/i;3^4fc5^6^7fc8 _ jm.xm2 jra^TU^ jfn^rnQ jm^rn^ 

mi,...,m8 

(98) 

(99) 


Here Yhmi me different pair combinations of (fci,..., feg) and Yl,m\_ ms 

is the sum of all 105 different pair combinations of (fci,..., k%). By using these results, Zq 
in ea. (l89l) is given by 


Zn = 


1 


j^d/2 

1 


'\u\<n^ 


exp{—H2{u))p{u)du 


n 


d/2 


'\u\<n^ 


{1-H2{u) + 


H2{uf 


H2{uf 

6 


+ Op{n ‘^)}p{u)du. (100) 


Then by using the symmetry of the integrated region, the integrations of the odd order 
terms are equal to zero. It follows that 


= 


Co 


n 


d/2 


1 + 


Yi{w) 


n 


+ a 


P\ 2 


( 101 ) 


where 


hi(ri)) = — [ {--^Lkj^k2k3kiu'"^u’"^u’"'^u^* +-^iLkik2k3u’"'^u’"^u^'^f}p{u)du 

Co J 14: (Z 

= + ( 102 ) 

where we used ea. dOSll and ea. dMI) . On the other hand, Zi in ea. (l88|) is given by 

1 f 


Z. = 


j^d/2 

1 


'\u\<n^ 


Hi{u) exp{—H 2 {u)) p{u)du 


H2{w)^ H2{uf 


n' 


d/2 


f - H2 {w) + - t!ZLZZL!—j^Op{n ‘^)}p{u)du. {103) 

J\u\<n^ 


6 


Then by using symmetry of the integrated region, it follows that 


Zi = 


n'^/2 V n J 


(104) 
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where Y 2 {w) and Y 3 (w) are given by 
Y2{w) 


^ - ^fkiLk2k3k4U^^u^^u^'^u^''}p{u)du 


Co 




2 

Ysiw) = P{u)du^^^fk4k2k3k4 


(105) 




Ill 

^ 1 on fkid^k2k3k4k5ke “1“ '^fkik2^k3k4k3ke Y ^^/A:ifc2fc3-^A:4fc5fc6 ) H ^ 

i=i 


120 “ 


36" 


+ 


1 ® 

144 Yk2^k3k4k3dik^kjks Y /fci-^^ 2 / 23 ^ 4 -^fcsfcsfcT^S) ^ J" 

i=i 


= lfk4k2k3k4j’^^’'^j'^^^^-(r^fk4Lk 


V120" 


I^k2k3k4k5 


Y — fkik2Lklk2k3k4 Y —fk4k2k3Lk4k3k^Sym{J)^^^^^^^'^^^^^ 

Y^^{fk4k2Lk3k4k3Lk,k,k3Yfk4Lk2k3k4Lk3k,krk3)Sym{jf^^^^^^^^^^^^^^ 

where we used ea. (|9^ . ea. ([96l) . and ea. (IW)l . Summing up these results, 

, Y2{w)/nYY^{w)/n'^+ Op{l/n^) ^ ^ 

— f{w)-\ - , V f -\ / —r?T7T7~2l- Y Y Op[—H)) 

I+ Yi(w)/n Y Opil/n-^) 

= f{w) YY 2 {w)/nY (Ysiw)-Yi{w)Y 2 {w))/ri^+ Op{l/n^). (106) 

Therefore, by putting 

Ri{f,w) = Y 2 {w), (107) 

R2{f,w) = Ys{w)-Y^{w)Y2{w), (108) 

the first half of Lemma is completed. The latter half is equal to the case when the training 
samples are leaving Xj out, hence it is immediately obtained from the first half. 
(Q.E.D.) 

Lemma 3. Let |q;| < 1. If m is a positive odd number, 

E^Jp(Xfc|u;)“nr=i(^"^-^"0] 


^<fio\Pi^k\w) 


= o. 


If m is a positive even number, 

E^XXfc|u;)“ nr=i(«'"^ 


^v’olpi^klw) 


For m = 2,4, 


‘ ( ^ ) 

(109) 

'P^n(m+1)/2L 


(110) 


(111) 


EM[n(a’‘' - + 0,(Y), 

j=i 

1 

^ Jk4k3jk2k4 ^ Jk4k4jk2k3^ ^ Op( —).(112) 
J. X T 7 E ^ n.J 


i=i 
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(Proof of Lemma 3) In this proof, we use same notations as the proof of Lemma2. By 
the regularity condition (5), 

-^"0] _Z*^ I p r Si 
E^MXklw)^] ^0* 

for an arbitrary /3 > 0, where 

~ m 

Zm = / Y\{w^^ — w^^)eyip{—nL{w) + nL{w) + ar]{Xk,w))dw 

Jw{e) 

= (m+d)/2 / JJu'‘^'exp(-iL2(w)+ a?/(-^fe,w + u/\/n))/3(u)du. (113) 

Here we used a notation, r]{Xk,w) = logp{Xi^\w) — logp{Xk\w). By ea. dOTTl . the first 
term of H 2 {u) is in proportion to Moreover, there exists u* such that r]{Xk,w + 

uj^fn) = {uj^fn)p{Xk^u*'). Hence by the expansion of 

exp(-H' 2 (u) + auri{XkjU*)ly/n) = 1 + {-H 2 {u) + aup{Xk,u*) / y/n} + Op(l/n), 

if m is an odd number, Z^ = which shows ea. (ll09p . and if m is an 

even number, Zm = Op(l/n(”*’'''^^/^), which shows eq. dllOh . By using eq. (f94|) and eq. (fM|) 
in the case a = 0, the results for m = 2,4 are derived. (Q.E.D.) 

Definition. We use several functions of w in the proof. 


S^^iw) 


(114) 

T^^{w) 


(115) 



(116) 


Lemma 4. Let w and Wj and he the MAP estimator for X^ and X^ \ Xj, respetivelly. 
Then 


iE{(%)‘‘ - ^rfc(w) + 0,(4), (117) 

i=i 

1 ^ 11 

-- »‘"} = ;pC'=‘‘>(*i) + o,(^). ( 118 ) 

i=i 

For an arbitrary C°°-class function f{w) 

1 1 1 

1=1 

+^/fcifc.(ih)H"i'=^(u;)} + Op(S). (119) 

(Proof of Lemma 4) In this proof, we use a notation, £(j,w) = logp{Xj,w). Since Wj 
minimizes L{w, —j), its derivative is equal to zero at wj, 

Lkiiwj,-j) = 0- ( 120 ) 
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By using the mean value theorem, there exists a parameter w* which satisfies — u)|| < 


\\wj — u)|| and 

Lki{w,-j) + = 0 . ( 121 ) 

Note that the MAP estimator w minimizes L{w) where 

L{w) = L{w,-j) - -^(j, w), (122) 

n 

hence its derivative satisfies Lk^[w) = 0. Therefore 

-^fci(w,-i) = ^4i(j,w)- (123) 

By applying eq. (|123|) to eq. (ll21D . 

= -^{L{w\-j)-^f^^HkAj,w). (124) 

Therefore Wj — w = Op{l/n), resulting that w* — w = Op{l/n). It follows that 

= -l{L{w)-^f^^Hk,{j^w) + Op{l/n\ (125) 


where we used L{w, —j) = L{w) + Op{l/n). Hence 
1 

J = 1 
1 ^ 

= - J;(L(u;)-1)"i"34^(j- + Op{l/n^) 

j=i 

= + Op{l/n^), (126) 

which shows ea. dllSp in Lemma 4.@Let us show ea. (11171) . From ea. (11201) . by using the 
higher order mean value theorem, there exists a parameter w** which satisfies ||tc** —wW < 
\\wj — tc|| and 

^ 4 i ( tw ) + Lk^k2{w, 

=0, (127) 

where we used eq. (jl23|) . The second term of eq. (I127|) is 

Lkik2iw,-j){{wj)^^ -w^^) 

= Lk^k2{w){[wjf'^ - + ^4ifc2(T^^)((^f'i)^^ - 

= Lkik2iw){{wj)^^ -w^^) - ^^k^k2{j,w){L{w)~^f^^'^(-ki{j,w) + Op{l/n^)ll2S) 
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where we used eq. (jl22p and eq. (jl25l) . Also by eq. (ll25|) . the third term of eq. (ll27p is 
= \Lk,k,k,{w){{w,f- - w^-){{w,p - + 0,{^) 

= (129) 

Then by applying eq. (ll28jl . eq. (jl29p . Lk^{w) = 0, and = jkik 2 eq. (jl27|) . the 

sum for j = 1,2,..., n of ea. (|127l) results in 

1 ” 1 

U.b (- 

i=i 

+ Op(^) = 0. (130) 

Therefore 

1 ” , 
n 

i=i 

+ Op(^), (131) 

which shows ea. (|117p in Lemma 4. The last equation in Lemma 4, ea. (|119l) . is proved by 

- 71 1 ^ 

j=i j=i 

1 ” 1 
- *‘’‘){(»i)‘= - »‘’")A.te + 0,{^) 
i=i 

= ;)2A.T‘-r*‘/2) + 5^A,bt/*'‘“ + o,(T), (132) 

which completes Lemma 4. (Q.E.D.) 


6 Proof of Theorem 1 

In this section, we prove the main theorems. The proof of Theorem 1 consists of the five 
parts, Cross validation, WAIC, mathematical relations, averages, and random generaliza¬ 
tion loss. 


6.1 Proof of Theoreml, Cross Validation 

In this subsection, we prove ea. (|39l) in Theorem 1. By Lemma 2, 


^vo[(P(.w)] 




Ri{(l),w) R2 {(I>,w) 

4>{w)n 4>{w)v? 


)+Op(;^), 


4>{wj)ii + 


+ 


R2{4>,Wj 


Rii(p,Wj 

4>{wj){n — 1 ) ' 4>{wj){n — 1)2 




n-^ 


(133) 

(134) 
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For an arbitrary C'°°-class function f{w), by Lemma 4, 


f{Wj) f{w) _ f{Wj) - f{w) f{w) 


n — 1 


n 


re — 1 

+oA 


+ 


n{n — 1) 






(135) 


f{wj) f(w) _ f{wj)-f{w) {2n-l)f{w) 

(re — 1)^ re^(re — 1)^ 


(re — 1)^ re^ 


= O, 


(136) 


By using Lemma 1 and by applying these equations for /{w) = Ri{w)/(j){w), R 2 {w)/(j){w), 

1 "■ 

CV(¥5) - CV(¥5o) = -^{logE^"^)[(/)(re;)]-logE<^o[())('u;)]} 


re 


i=i 


- ^{log (t){wj) - log (/)(«;)} + + Op(^)- (137) 

re (p(w)n^ n-^ 

j=i 


By using Lemma 2 and 4, 


CV((/p) - CV(^o) = + 




Then by using 




4'kJ4' = (log</>)fci, 

4>ki kJ4> = (log (j)) ki k2 + (log (/>) fci (log (/>) fc2 , 


(138) 


it follows that 


CV(¥P) - CV((^o) = ^{logcl2)kAS^^-^T’^^-W’^^) 




1 


+^(log(())fei(log(/>)fc 2 + Op(^), 


(139) 


which completes ea. ([39]) . (Q.E.D. 


6.2 Proof of Theorem 1, WAIC 

In this subsection we prove ea. (l41D and ea. (|43D in Theorem 1. In the following, we prove 
ea. (|43]) . In order to prove ea. (H3]) . it is sufficient to prove ea. (H3]) in the case (^(tt))0 = ^po{w) 
for an arbitrary ipQ{w). Let the functional cumulant generating function for (Pq{w) be 




2 = 1 
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For a natural number j, we define the jth functional cumulant by 

f)j 

C,(a) = 

Then by definition, = 0 and 

CV(99o) = (140) 

WAIC((^o) = T(99o) + y(v:>o)/n = -F,V(l) + C 2 ( 0 ). (141) 


For a natural number j, let mj{Xi, a) be 

tv \ exp(-a7/(Xi,u;))] 

lK^[exp[-arj{Xi,w))] 

where T]{Xi,w) = logp{Xi\w) — logp(Xj|t()). Note that 

r]{Xi,w) = - w’^^)ekj^{Xi,w) + Op((w - wf). 

Therefore, if j is an odd number, by using Lemma 3, 

1 


or if j is an even number 


mj(Xi,a) = Op(-^-^) 


m-j {Xi ,a) — Op (^2) • 


(142) 

(143) 

(144) 


Since p{Xi\w) is a constant function of w, 

1 ” (0® 
i=l 

1 

= -'^-^^ogEp^[{p{Xi\w)/p{Xi\w)Y 

i=l 

1 5® 

= -'^-^^ogEpYewi-aviXilw))] 

i=l 


n 


'^{niQ — Gmsmi — 15m4m2 + 30m4m^ — 10m| + 120 m 3 m 2 mi 


i=l 


1 


— 1207713777.? + 307770 — 2707770777? + 3607772777? — 120777?} = Op(^), (14:5) 

77'^ 


where rrik = mk{Xi,a) in eq. (1145h . Hence by eq. (ll40j) . 


cv(.p„) = E^wC,( 0 )+Op{l 3 ). 

On the other hand, by ea. (11411) . 


(146) 


i=i 


WAIC(v 7 o) = Yl + ^2(0) + Op{^). 

i=l ” 


(147) 
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It follows that 


WAIC((^o) = CV(ypo) - 4^4(0) + 0 ^{\). 

12 71'^ 

Hence the main difference between CV and WAIC is C'4(0)/12. In order to prove ea. (l43p . 
it is sufficient to prove (74(0) = Op(l/n^). 

2=1 

1 

= n ^ ^ logE^MXi\w)/p{Xi\w))^^^^ 

2=1 

1 "" 

= — > {m4 — Am^nii — Sm^ + 12m2m\ — 6mf}, (148) 

2=1 

where ruk = mk{Xi,0) in ea. ()148ll . By ea. (II43p and ea. (ll44l) . 

n 

(74(0) = - y^{m^{Xi, 0) - 3m2(X„ 0)2} + Op(^). 
i=l 

By using a notation 

I " 

FkiMMM = — y^Jki{Xi)£k2{Xi)ik3{Xi)ik4{Xi), 

^ • 1 
2=1 

it follows that by ea. (ll42p and Lemma 3, 

1 Q 

-Vm4(X,) = 
n 

2 = 1 

1 "" 1 

-^m2(X42 = + (150) 

2=1 

resulting that (74(0) = Op(l/n^), which completes ea. (H3]) . Then, ea. iHTI) is immediately 
derived using eg. ((39} and eq. (l43p . (Q.E.D.) 

6.3 Mathematical Relations between Priors 

In this subsection, we prove eq.dH]), ea. lH^ . and ea. (|36]) . Since w minimizes L{w), 

Lk^{w) = 0. 

There exists w* such that ||r(;* — rco|| < ||rc — tcoH and that 

Lkiiwo) + Lk^k2{w*){w - tro) = 0. 

By the regularity condition (3), re —>■ tco resulting that w* ^ wq. By using the central 
limit theorem, 

w-wo = {Lkik 2 {w*))~^ Lk^iwo) = Op(^). (151) 

V n 
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Also by the central limit theorem, for an arbitrary tc E Vh, 

= E[Lk,k,H] + ^, (152) 

Lkikiksiw) = E[Lk^k2ki{w)] + , (153) 

= E[Ffc,,fc,(rn)] + ^, (154) 

Fk^k2,k3i'^) = E[-^fclfc2,fc3(«^)] + (155) 

where Pk^k^i Pkik 2 k 3 , 7 fcifc 2 a-nd Ikik 2 k 3 are constant order random variables, whose expec¬ 
tation values are equal to zero. By the definitions, 

Ck^k2{w) = HLk^k2{w)] + 0{^), (156) 

^kik2k3{w) = E[Lk^k2k3{w)] + 0{^), (157) 

J'kiMi'^) = IE[Ffc fc2(u;)]-FO(-), (158) 

n 

J^kik2,k3i'^) = ^[^kik2,k3i'^)]+0{-), (159) 


Let /? = {f3kik2} and A = {Lkj^k 2 {w)}- Then by ea. (|152p and ea. ()156p . 

J{w) = E[A]-i + 0(i) 

n 

= (A-/3/V^)-i + 0(i) 

n 

= (A(l-A-i/?/v^))-i + 0(i) 

n 

= (1 + A ^/3/v^)A ^ + Op(—) 

n 

= J(u;) + A-i/3A-VV^ + Op(-). 

n 

Hence 

E[j’‘^^^{w)] = +0{l/n). 

It follows that 

M{(j),wo) = M{4',wo) + Op{l/^/n), 

E[M{4i,wo)] = M{4',wo) + 0{l/n). 

Hence 

M{(j),w) = M{(l),wo) + (w - wo)^^iM{4>,wo))ki = M{(f>,wo) + Op{^), 

V n 

E[M{(l),w)] = M{c/),wo) + 0{-), 

n 

which shows ea. (l44l) and eQ. p46l) . Then ea. (l45l) is immediately derived by the fact w — 
E(p[t(;] = Op(l/n) by Lemma 3. (Q.E.D.) 
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6.4 Proof of Theorem 1, Averages 

In this subsection, we prove we show eq. (f4n]i . eq. (l4^ . and eq. (l48]) . 

Firstly, ea. dlOll is derived from eq.([39]) and ea. (H6]) . Secondly, eQ. (|42|) is derived from 
eq. m and eq. fjiHIl . Lastly, let us prove eq. (f48]l . Let CV„(<y9) and Gni<f) be the cross 
validation and the generalization losses for A"’, respectively. Then by the definition, for 
an arbitrary ip, 


E[G„(v?)] 


E[CV„+i(99)] 

E[CV„+i(</Po)] + 


E[M((y9, rc)] 
(n + 1)2 








K[M{ip ,«})] 




+ o{^), 


n'^ 


(160) 


where we used l/m? — l/(n + 1)^ = 0(l/n^), which completes ea. lflSll . (Q.E.D.) 


6.5 Proof of Theoreml, Random Generalization Loss 

In this subsection, we prove ea. dmi in Theorem 1. We use a notation i{n + Ijtc) = 
logp(A„+i|t(;). Let w be the parameter that minimizes 


1 1 n 1 

— Vlogp(Ailrc)- —\ogLp{w) = — —\l{w) -£(n + l,u;)|. 

+ 1 n+1 n+lt n ) 

1=1 


n + 


Since w minimizes L{w) — i{n + l,w)/n, 


1 


Lki (w) -4i (n + 1, w) = 0. 


n 


(161) 


By applying the mean value theorem to ea. (ll61D . there exists w* such that 


Lkiiw) + {w^'^ - w’^'^)Lk^k2{w*) - -4i(^^ + 1,'ic) = 0. 

n 

By using Lki{'w) = 0 and positive dehniteness of Lfcjfc 2 (u’), 

=0„(-). (162) 
n 

By applying the higher order mean value theorem to eq. ljlGip . there exists w** such that 
(w^2 - w'"^)Lk^k2i'w) + - w^'')Lk^k2k3{w**) - ^4i(?^ + 1,'w) = 0. 

By ea. (jl62l) . the second term of this equation is Op(l/n^). The inverse matrix of Lk^k 2 {w) 
is 

y)G _ ^^1 = t ^ _l_ (9 (■ J_y (163) 

n 

By ea. (jl51l) . w — wq = Op{l/y/n). Hence by the expansion of (rc — tco), 

Ex„+i [ik2{n + 1, w)] = Ex„+i [(logp(An+i|T!))fc2] 

= - (u;o)^4Ex„+i[(logp(A„+i|u;o))fc2fe3] + 
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where we used Ex„+i[(logp(X„+i|t(;o))A: 2 ] = 0. By eq. (ll63p and eq. (ll64p . 


where we used = (E[Lfc 2 fc 3 (^o)])~^ + Op{l/n^^‘^). Therefore, by Lemma 1 and 2 


G{(p) - G(v9o) = -Ex„+i [log e1,+(”+^11 [(/.(u;)] - logE^^^[(/.(u; 


= -Ex„+i[-log((/>(u;) + + log(</»(w) + + <9p(^) 

/ i ~1“ J. IL f i 


= -Ex„+i[(u;^' - ra^i)](log (/))fci (w)+ C>p(^) 

= - (u;o)^^)(log</>)fci(u^) + Op(^). 

n 

By ea. pi51l) . w — wo = Op{l/^/n), we obtain ea. (H7P . (Q.E.D.) 


(165) 


6.6 Proof of Theorem 2 

If there exists a parameter which satisfies q{x) = p{x\wo), then w — wq = Op{l/^/n), 
{Lk^k2){w) = Ckik2{w) + Op{l/y/n), {Lk^k2H){w) = Ckik2iw) + Op{l/^/n), (Tfci,fc2)('^) = 
J^kik 2 {w) +Op(l/^/n), and (Tfcifc2,fc3)(^) = •^feifc2,fc3(^) +Op{l/^/n). Hence Theorem 2 is 
obtained. (Q.E.D.) 


7 Discussions 

In this chapter, we discuss several points about predictive prior design. 

7.1 Summary of Results 

In this paper, we have shown the mathematical properties of Bayesian CV, WAIC, and 
the generalization loss. Let us summarize the results of this paper. 

(1) Even if the posterior distribution is not normal or even if the true distribution is 
unrealizable by a statistical model, CV and WAIC are applicable to predictive prior design. 
Theoretically CV and WAIC are asymptotically equivalent, whereas experimentally the 
variance of WAIC is a little smaller than CV. In the regularity conditions are satisfied, 
then CV and WAIC can be approximated by WAICR. The variance of WAICR is a little 
smaller than CV and WAIC. 

(2) If the true distribution is realizable by a statistical model, then CV and WAIC can be 
estimated by WAICRS. The variance of WAICRS is very smaller than WAICR. 

(3) If the posterior distribution is rigorously normal and if the true distribution is realizable 
by a statistical model, then DIC is almost equal to WAICRS. If otherwise, then DIC is 
different from CV, WAIC, or WAICRS and the chosen hyperparameter by DIC is not 
optimal for predictive prior design in general. 

(4) The marginal likelihood is not appropriate for predictive prior design. 
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7.2 Divergence Phenomenon of CV and WAIC 

In this subsection we study a divergence phenomenon of CV, WAIC, and the marginal 
likelihood. Let the maximum likelihood estimator be Wmie and 5mie{w) = 5{w — Wmie)- 
Then for an arbitrary proper prior ip{'w), CY{(p) > CY{5mie), WAIC((/9) > WAIC((5mie)) 
and Ffreeif) > Ffreei^mie)- Hence, if a candidate prior can be made to converge to 
^mlei'w), then minimizing these criteria results in the maximum likelihood method, where 
Theorem 1 does not hold. 

Assume that a proper prior (piw) = (p{w\\) has a hyperparameter A. The set of 
divergent parameters of (/9(t(;|A) is defined by 

^divi'^) = £ fP ; There exists {A^} s.t. lim ip{w\Xk) —W)}. 

fc—>-oo 

For example, if <y9(u)|A) = X/2tt exp{—Xw^/2), then Wdivi^p) = {0}) because Xk = k gives 
a sequence of priors which converges to the delta function. 

If the optimal parameter wq that minimizes the average generalization loss is contained 
in Wdivi'-p)^ then the optimal hyperparameter does not remain in a compact set as n ^ oo. 
For example, the optimal hyperparameter for WAICRS in eQ. (|80l) diverges if ri) —)• 0. In 
such cases, CV or WAIC may not have any minimum point, then the hyperparameter can 
not be optimized by using CV or WAIC. In such cases, some hyperprior or regularization 
term which is necessary. It is an important study to clarify the set of divergent parameters 
of a given prior. For example, if the Dirichlet distribution on a E (0,1) 

Dir{a\Xi,X2) = C(Ai, A2)a^ni " 

is used as a prior, then an arbitrary parameter in (0,1) is contained in WdiviDir), because 
Dir{a\bok, c^k) —>■ 5{a — 6o/(^o + cq)) {k —)• oo). 

7.3 Training and Testing Sets 

In practical applications of machine learning, we often prepare both a set of training 
samples A” and a set of test samples V”^, where A”" and are independent. This method 
is sometimes called the holdout cross validation. Then we have a basic question, “Does 
the optimal hyperparameter chosen by CV or WAIC using a training set A"’ minimize 
the generalization loss estimated using a test set Y^ ?”. The theoretical answer to this 
question is No, because, as is shown in Theorem I, the optimal hyperparameter for A”" 
asymptotically minimizes E[G(A”)] but not G(A”'). If one would find the hyperparameter 
which minimizes G(A’^), then neither CV, WAIC, DIC, nor the marginal likelihood is 
appropriate. On the other hand, if one wants to measure the optimality of the chosen 
hyperparameter by the criterion E[G(A”')], then CV or WAIC using V™ is useful, because 
the optimal hyperparameter using is asymptitically equal to that using V™. 

7.4 Self-Averaging and Bootstrap 

In this paper, we have shown that the optimal hyperparameter for the minimum average 
generalization loss can be found by CV and WAIC asymptotically. However, the variance 
of the estimated hyperparameter is sometimes not small. In experiments, WAICRS has 
very smaller variances than CV and WAIC. Although WAICRS can be used only in the 
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case when the true distribution is realizable by a statistical model, it may be useful by its 
small variance. 

If a statistical model p{x\w) is complicated and if it is difficult to derive the mathe¬ 
matical form of WAICRS, then it can be estimated numerically by 

WAICRS Ri Eyn [WAIC(y^, p) - WAIC(y^, po)] 

where R” is taken from the Bayesian predictive distribution p*{x) and WAIC(y",(/9) is 
WAIC for a set Moreover, if R” is taken from the empirical distribution (1/n) ^ (5(x — 
Xi), then the above equation approximates WAICR. These numerical methods may need 
heavy computational costs, however, they may be useful if the precise hyperparamer op¬ 
timization is necessary. 

8 Conclusion 

In this paper, we studied several methods how to design the hyperparameter from the pre¬ 
dictive point of view. The mathematical relation between priors gives the explicit criterion 
of the prior and its variance is made smaller by using self-averaging. To construct the gen¬ 
eralized theory of this paper onto singular statistical model is the important problem for 
future study. 
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